42 research outputs found
Linear, Deterministic, and Order-Invariant Initialization Methods for the K-Means Clustering Algorithm
Over the past five decades, k-means has become the clustering algorithm of
choice in many application domains primarily due to its simplicity, time/space
efficiency, and invariance to the ordering of the data points. Unfortunately,
the algorithm's sensitivity to the initial selection of the cluster centers
remains to be its most serious drawback. Numerous initialization methods have
been proposed to address this drawback. Many of these methods, however, have
time complexity superlinear in the number of data points, which makes them
impractical for large data sets. On the other hand, linear methods are often
random and/or sensitive to the ordering of the data points. These methods are
generally unreliable in that the quality of their results is unpredictable.
Therefore, it is common practice to perform multiple runs of such methods and
take the output of the run that produces the best results. Such a practice,
however, greatly increases the computational requirements of the otherwise
highly efficient k-means algorithm. In this chapter, we investigate the
empirical performance of six linear, deterministic (non-random), and
order-invariant k-means initialization methods on a large and diverse
collection of data sets from the UCI Machine Learning Repository. The results
demonstrate that two relatively unknown hierarchical initialization methods due
to Su and Dy outperform the remaining four methods with respect to two
objective effectiveness criteria. In addition, a recent method due to Erisoglu
et al. performs surprisingly poorly.Comment: 21 pages, 2 figures, 5 tables, Partitional Clustering Algorithms
(Springer, 2014). arXiv admin note: substantial text overlap with
arXiv:1304.7465, arXiv:1209.196
Webometrics benefitting from web mining? An investigation of methods and applications of two research fields
Webometrics and web mining are two fields where research is focused on quantitative analyses of the web. This literature review outlines definitions of the fields, and then focuses on their methods and applications. It also discusses the potential of closer contact and collaboration between them. A key difference between the fields is that webometrics has focused on exploratory studies, whereas web mining has been dominated by studies focusing on development of methods and algorithms. Differences in type of data can also be seen, with webometrics more focused on analyses of the structure of the web and web mining more focused on web content and usage, even though both fields have been embracing the possibilities of user generated content. It is concluded that research problems where big data is needed can benefit from collaboration between webometricians, with their tradition of exploratory studies, and web miners, with their tradition of developing methods and algorithms
Intelligent Routing System for a Personalised Electronic Tourist
When tourists are at a destination, they typically search for information in the Local Tourist Organizations. There, the staff categorizes tourists’ profile and restrictions. Combining this information with their up-to-date knowledge about the local attractions, weather and public transportation, they suggest a personalised route for the tourist agenda. This paper presents an intelligent routing system for a Personalised Electronic Tourist Guide to fulfil the same task. This system improves the automatic route creation functionality of existing PETs to solve better the needs of tourists in several aspects: i) it includes public transportation, ii) it takes varying travelling times into account, adapting to real circumstances as rush-hours, iii) it calculates routes in real time to react to unexpected events, iv) it applies last generation heuristics from Operations Research to create routes efficiently, even in destinations with a large number of point of interests and a dense public transportation network.status: publishe
Selecting the Minkowski Exponent for Intelligent K-Means with Feature Weighting
Recently, a three-stage version of K-Means has been introduced, at which not only clusters and their centers, but also feature weights are adjusted to minimize the summary p-th power of the Minkowski p-distance between entities and centroids of their clusters. The value of the Minkowski exponent p appears to be instrumental in the ability of the method to recover clusters hidden in data. This paper advances into the problem of finding the best p for a Minkowski metric-based version of K-Means, in each of the following two settings: semi-supervised and unsupervised. This paper presents experimental evidence that solutions found with the proposed approaches are sufficiently close to the optimum.Peer reviewe
A Cluster Analysis Approach for Rule Base Reduction
In this paper we propose an iterative algorithm for fuzzy rule base simplification based on cluster analysis. The proposed approach uses a dissimilarity measure that allows to assign different importance to values and ambiguities of fuzzy terms in antecedent and consequent parts of fuzzy rules